-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IGNITE-22662 : Snapshot check as distributed process #11391
IGNITE-22662 : Snapshot check as distributed process #11391
Conversation
…DistrProc # Conflicts: # modules/core/src/test/java/org/apache/ignite/internal/processors/cache/persistence/snapshot/IgniteClusterSnapshotCheckTest.java
...a/org/apache/ignite/internal/processors/cache/persistence/snapshot/SnapshotCheckProcess.java
Outdated
Show resolved
Hide resolved
IgniteSnapshotManager snpMgr = kctx.cache().context().snapshotMgr(); | ||
|
||
if (allRestoreHandlers) { | ||
workingFut = CompletableFuture.supplyAsync(() -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should specify the snapshot executor for the async job
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid not if we use the same executor somewhere inside the task. I tried. The executor might be confgured with single thred. This thread is blocked with the waiting task. No thread left for the workers. Tests like testChangeSnapshotTransferRateInRuntime()
hangs.
...a/org/apache/ignite/internal/processors/cache/persistence/snapshot/SnapshotCheckProcess.java
Outdated
Show resolved
Hide resolved
...a/org/apache/ignite/internal/processors/cache/persistence/snapshot/SnapshotCheckProcess.java
Outdated
Show resolved
Hide resolved
void interrupt(Throwable err) { | ||
contexts.forEach((snpName, ctx) -> { | ||
if (ctx.fut != null) | ||
ctx.fut.onDone(err); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible interrupt
will not interrupt anything. You still have a period of time:
- First phase don't start and futures are nulls.
- First phase finished, second phase still not started and futures are nulls.
I again recommend you create a future that lives on every node during all process and phase futures listen it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interrupt()
is a node stopping. SnapshotManager stops, its thread pool stops. Discovery should not work, process phases should start and be able to work because SnapshotManager stops. No problem is expected. Also, keeping single future requires resseting it. The same race. Canceling a finished future does nothing. Even not soring the exception and a canceled
flag. reset()
would revive future and restart it.
...a/org/apache/ignite/internal/processors/cache/persistence/snapshot/SnapshotCheckProcess.java
Show resolved
Hide resolved
if (ctx.fut != null) | ||
ctx.fut.onDone(err); | ||
|
||
it.remove(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you remove context here? Reduce phase is already responsible for clean up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we do not store error any more. We cannot recognize that we should not work at the reduce. The phase won't work if there is no context. If it is, the phase works. Also because stop future here.
SnapshotCheckContext ctx; | ||
|
||
// The context can be null, if a required node leaves before this phase. | ||
if (!req.nodes().contains(kctx.localNodeId()) || (ctx = context(null, req.requestId())) == null || ctx.locMeta == null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does ctx.locMeta == null
cover case !req.nodes().contains(kctx.localNodeId())
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. We've added client-initiator to the required nodes. It has no meta. Previously, the required nodes were only data nodes. Also snapshot might be placed in any manner to the baseline nodes and/or be restored from another cluster. Any node may miss snapshot meta. Test testRestoreFromAnEmptyNode()
show case like that.
...a/org/apache/ignite/internal/processors/cache/persistence/snapshot/SnapshotCheckProcess.java
Outdated
Show resolved
Hide resolved
...a/org/apache/ignite/internal/processors/cache/persistence/snapshot/SnapshotCheckProcess.java
Outdated
Show resolved
Hide resolved
* | ||
* @param err The interrupt reason. | ||
*/ | ||
void interrupt(Throwable err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is invoked after IgniteSnapshotManager#busyLock
acquired. But checking snapshot doesn't check this lock. Looks like all this collections aren't synchronized with stopping node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method shoud be be renamed to 'onStop()'. Everything is stopping. SnapshotManager is stopping, thread pools is stopping. Discovery should not accept or process messages. No another check process must be able to start. Even it starts, it must not be able to work due to stopping thread pools. No problems are excpected.
...a/org/apache/ignite/internal/processors/cache/persistence/snapshot/SnapshotCheckProcess.java
Outdated
Show resolved
Hide resolved
...a/org/apache/ignite/internal/processors/cache/persistence/snapshot/SnapshotCheckProcess.java
Outdated
Show resolved
Hide resolved
|
||
// A not required node can leave the cluster and its result can be null. | ||
return results.entrySet().stream() | ||
.filter(e -> requiredNodes.contains(e.getKey()) && e.getValue() != null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we already check requiredNodes in assert in the reduceValidatePartsAndFinish
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Results can contain nulls, come from not required, not a data nodes. NPEs will arise.
...n/java/org/apache/ignite/internal/processors/cache/persistence/snapshot/SnapshotChecker.java
Outdated
Show resolved
Hide resolved
boolean skipPartsHashes | ||
) { | ||
try { | ||
return checkPartitions(meta, snpDir, groups, forCreation, checkParts, skipPartsHashes).get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, let's invoker calls get()
with timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have no any timeout on snapshot validation. User cannot define it. Even can't cancel yet. There is no any value to pass.
catch (IgniteCheckedException e) { | ||
throw new IgniteException("Failed to check partitions of snapshot '" + meta.snapshotName() + "'.", e); | ||
} | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
snapshot executor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. The check funtions uses it, the same executor. If it configured with just 1 thread (setSnapshotThreadPoolSize(1)
), we'll freeze here. Tests like testChangeSnapshotTransferRateInRuntime()
would hang.
No description provided.